Estimating GPU Speedups for Programs Without Writing a Single Line of GPU Code

نویسندگان

  • Newsha Ardalani
  • Karthikeyan Sankaralingam
  • Xiaojin Zhu
چکیده

Heterogeneous processing using GPUs is here to stay and today spans mobile devices, laptops, and supercomputers. Although modern software development frameworks like OpenCL and CUDA serve as a high productivity environment, software development for GPUs is time consuming. First, much work needs to be done to restructure software and data organization to match the GPU’s many-threaded programming model. Second, code optimization is quite time consuming and performance analysis tools require significant expertise to use effectively. Third, until the final optimized code has been derived, it is almost impossible today to know what performance advantage will be provided by porting a code to a GPU. This paper focuses on this last question and seeks to develop an automated “performance prediction” tool that can provide accurate estimate of GPU speedup when provided a piece of CPU code prior to developing the GPU code. Our paper is built on two insights: i) Ultimately speedup on a GPU for a piece of code is dependent on fundamental microarchitecture-independent program properties like available parallelism, branching behavior etc. ii) By examining a vast array of previously implemented GPU codes alongwith their CPU counterpart, we can use machine learning to learn this correlation between program properties and GPU speedup. In this paper, we use linear regression, specifically, a technique inspired by regularized regression, to build a model for speedup prediction for GPUs. When applied to a never-seen test data selected randomly from Rodinia, Parboil, Lonestar and Parsec benchmark suites, as test data (speedup range of 5.9× to 276×), our tool makes accurate predictions with an average weighted error of 32%. Our technique is also robust the errors remain similar across other “unseen” GPU platforms we test on. Essentially, we deliver an automated tool that programmers can use to estimate potential GPU speedup before writing any GPU code.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerating high-order WENO schemes using two heterogeneous GPUs

A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...

متن کامل

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...

متن کامل

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...

متن کامل

Multi-Stage Programming for GPUs in Modern C++ using PACXX

Writing and optimizing programs for high performance on systems with GPUs remains a challenging task even for expert programmers. One promising optimization technique is to evaluate parts of the program upfront on the CPU and embed the computed results in the GPU code allowing for more aggressive compiler optimizations. This technique is known as multi-stage programming and has proven to allow ...

متن کامل

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014